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Abstract. Motivated by the practice of exploratory research, we for- 
mulate an approach to multiple testing that reverses the conventional 
roles of the user and the multiple testing procedure. Traditionally, the 
user chooses the error criterion, and the procedure the resulting rejected 
set. Instead, we propose to let the user choose the rejected set freely, 
and to let the multiple testing procedure return a confidence statement 
on the number of false rejections incurred. In our approach, such con- 
fidence statements are simultaneous for all choices of the rejected set, 
so that post hoc selection of the rejected set does not compromise their 
validity. The proposed reversal of roles requires nothing more than a re- 
view of the familiar closed testing procedure, but with a focus on the 
non-consonant rejections that this procedure makes. We suggest several 
shortcuts to avoid the computational problems associated with closed 
testing. 
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1. INTRODUCTION 



Central to the practice of statistics is the dis- 
tinction between exploratory and confirmatory data 
analysis, and the interplay between the two. Ex- 
ploratory data analysis suggests and formulates hy- 
potheses, which can subsequently be rigorously 
tested by confirmatory data analysis. The two types 
of data analysis require very different methods (Tu- 
key, 1980): where confirmatory data analysis is struc- 
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tured and rigorous, exploratory data analysis can be 
open-minded and speculative. 

Hypothesis testing and strict Type I error con- 
trol are traditionally part of the realm of confir- 
matory data analysis, and, by implication, so are 
multiple testing procedures. However, multiple hy- 
pothesis testing is increasingly finding its way into 
exploratory data analysis. In genomics research, for 
example, typical experiments test thousands of hy- 
potheses corresponding to as many molecular mark- 
ers. Although somewhat structured, such experi- 
ments should be viewed as exploratory rather than 
as confirmatory. The collection of tested hypotheses 
is usually not selected on the basis of any theory, 
but because it is convenient and exhaustive. The 
rejected hypotheses are generally not meant to be 
reported as end results, but are to be followed up 
by independent validation experiments. 

Despite the exploratory nature of these experi- 
ments, researchers do feel a need for multiple hy- 
pothesis testing methods and, in fact, routinely ap- 
ply them. The main reason for this is that researchers 
want to protect themselves from following up on too 
many false leads and doing too many unsuccessful 
validation experiments. Most multiple testing meth- 
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ods, however, have been designed for confirmatory 
data analysis and are ill-suited for the specific re- 
quirements of exploratory research. 

Before we come to the main argument of this pa- 
per, we would like to set the scene by sketching 
the requirements for an inferential procedure for ex- 
ploratory research. Imagine the situation that we 
are exploring a large, but finite number of candi- 
date hypotheses, indiscriminately selected. Rather 
than rigorously proving the validity of some or all 
of these hypotheses, as in confirmatory analysis, we 
want to select a number of promising hypotheses for 
further probing in a next phase of validation. The 
open-minded nature of exploratory research can be 
described by three characteristics: exploratory re- 
search is mild, flexible and post hoc. We explain these 
three terms below, contrasting them with the more 
familiar characteristics of confirmatory research. 

An inferential procedure is mild if it allows some 
false positives among the selected hypotheses. This 
is the most obvious characteristic of exploratory re- 
search. Mildness is reasonable because false positives 
are expected to be detected and removed in later val- 
idation experiments. Confirmatory research, in con- 
trast, being the final phase of the research cycle, is 
not mild but strict. 

An inferential procedure is flexible if it does not 
prescribe to the researcher which precise hypothe- 
ses to select or not to select. For example, if the 
procedure ranks the hypotheses from most to least 
promising, but the researcher detects a common 
theme in the hypotheses ranked second, third and 
fourth, he or she can choose to follow up on these 
three hypotheses and disregard the hypothesis that 
ranked first. In fact, the researcher may also choose 
to follow up on the hypothesis that ranked last, if 
that fits the same theme. Such freedom, "picking 
and choosing," is an important part of the hypothe- 
sis-generating aspect of exploratory research. In con- 
firmatory research, in contrast, selection of an in- 
teresting and coherent collection of hypotheses has 
been done prior to the experiment, so that flexible 
selection is not necessary. 

Finally, an inferential procedure is post hoc if it 
allows all choices that are inherent to the procedure 
to be made after seeing the data. Specifically, how 
mild the procedure should be, and which precise set 
of hypotheses to select does not have to be chosen 
beforehand, but may be chosen on the basis of the 
data. This is probably the most distinguishing fea- 
ture of exploratory research. The decision which in- 
ferences, and how many, to follow up is often based 



on a mixture of considerations; these considerations 
are usually not purely statistical, and are often diffi- 
cult to make explicit. In contrast, in pure confirma- 
tory research all choices regarding the testing proce- 
dure have to be set in stone before data collection. 

An ideal multiple hypothesis testing procedure for 
exploratory research should sanction a mild, post 
hoc and flexible approach. Multiple testing proce- 
dures generally do not fulfil these criteria. The main 
present distinction is between multiple testing meth- 
ods based on the familywise error (FWER), and 
variants, and methods based on the false discovery 
rate (FDR), and variants of that. 

FWER-based methods control the probability of 
making any false rejection at a prespecified rate. 
These are the archetypical methods for confirma- 
tory analysis. Such methods are clearly not mild, 
and they are not post hoc, as all data analysis deci- 
sions have to be made before seeing the data. They 
can be argued to be flexible in a limited sense: it 
is possible to refrain from rejecting some of the re- 
jected hypotheses without violating control of the 
familywise error, but it is not possible to reject any 
hypotheses that were not selected by the procedure. 
A variant of familywise error, fc-FWER, has been 
formulated that controls the probability of making 
at least k > 1 false rejections (Romano and Wolf, 
2007). Depending on k, methods with this error rate 
are mild and are flexible in the same limited way 
as FWER itself is. Still, /c-FWER-based methods 
have so far only attracted theoretical interest as in 
these methods value of k may not be chosen post 
hoc, and nobody knows how to choose k a priori in 
an applied setting. A recent permutation method 
of Meinshausen (2006) can be seen as a method 
that controls fc-FWER simultaneously for all values 
of k, and consequently allows post hoc selection of k. 
This method is mild, post hoc, and quite flexible, al- 
though it does not allow a fully arbitrary selection 
of the set of rejected hypotheses. 

False Discovery Rate (Benjamini and Hochberg, 
1995) methods control the expected proportion of 
falsely rejected hypotheses among the rejected hy- 
potheses. Such methods are not very well suited for 
traditional confirmatory research and take a step 
toward exploratory research. FDR-based methods 
are certainly mild compared to FWER-based meth- 
ods. However, they are not post hoc, as the set of 
rejected hypotheses is completely determined after 
setting the FDR threshold. Moreover, FDR-based 
methods are not flexible: as shown by Finner and 
Roters (2001), and illustrated in a practical exam- 
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pie by Marenne et al. (2009), selecting a subset from 
the hypotheses that the FDR-controlling procedure 
rejects may increase the false discovery rate above 
the prespecified level, just like, of course, selecting 
a superset can. Many variants of FDR have been 
proposed (e.g., Storey, 2002; Efron et al., 2001; Van 
Der Laan, Dudoit and Pollard, 2004), but none of 
these has the desired three characteristics of the 
ideal multiple testing procedure for exploratory in- 
ference. Methods have been formulated for selective 
inference (Benjamini and Yekutieli, 2005), but these 
still do not allow the full flexibility of exploratory se- 
lection. 

In this paper we present an approach to multiple 
testing that does allow mild, flexible and post hoc 
inference. By the nature of the requirements of be- 
ing flexible and post hoc, such a procedure cannot 
prescribe what hypotheses to reject, but can only 
advise. This reverses the traditional roles of the user 
and the procedure in multiple testing. Rather than, 
as in F WER- or FDR-based methods, to let the user 
choose the quality criterion, and to let the procedure 
return the collection of rejected hypotheses, the user 
chooses the collection of rejected hypotheses freely, 
and the multiple testing procedure returns the as- 
sociated quality criterion. In our view, the task of 
a multiple testing procedure in the exploratory con- 
text is not to dictate what to reject, but to quantify 
the risk taken, in terms of the potential number of 
false rejections, of following up on any specific set of 
hypotheses, chosen freely. 

This reversal of roles can be achieved while avoid- 
ing the pitfall of proposing yet another variant of 
FWER or FDR; it can be done simply by com- 
bining the familiar concept of the confidence set, 
the discrete version of the confidence interval, with 
the well-known closed testing procedure (Marcus, 
Peritz and Gabriel, 1976), widely recognized as a 
fundamental principle of multiple testing. What we 
will show is that the closed testing procedure can 
be used to construct exact simultaneous confidence 
sets for the number of false rejections incurred when 
rejecting any specific set of hypotheses, measuring 
the risk of following up on this particular set of hy- 
potheses. Because the confidence sets are simulta- 
neous over all possible sets of rejected hypotheses, 
the user is free to optimize, making the procedure 
valid even under post hoc selection of the rejected 
set. 

The approach we propose is constrained by the 
requirement that the number of hypotheses poten- 
tially to be followed up is finite and that these hy- 



potheses can be listed a priori. While this require- 
ment rules out the most open-minded and unstruc- 
tured applications of exploratory research, many ex- 
ploratory problems are structured enough to fit the 
framework. 

Our proposed procedure has strong links to k- 
FWER methods. In fact, the constructed confidence 
sets can be seen as controlling the fc-FWER, but 
simultaneously for all values of k, thus sanction- 
ing post hoc selection of k and removing the re- 
quirement of selecting k a priori, which tradition- 
ally plagues fc-FWER-based methods. Through this, 
our method links to the approach of Meinshausen 
(2006); we come back to this link in Section 4.2. 

Another interesting link is with methods that have 
appeared in recent years for estimating 7To , the num- 
ber of true hypotheses among the collection of all hy- 
potheses (Schweder and Spj0tvoll, 1982; Benjamini 
and Hochberg, 2000; Langaas, Lindqvist and Fer- 
kingstad, 2005; Meinshausen and Btihlmann, 2005; 
Jin and Cai, 2007). The procedure outlined in this 
paper automatically gives a confidence set for the 
quantity ttq, because the collection of all hypotheses 
is one of the possible sets of rejected hypotheses that 
the user can choose to follow up, and the number of 
false rejections in that set is exactly ttq. 

The outline of this paper is as follows. In the next 
section, we review the closed testing procedure and 
the role of the concept of consonance in that pro- 
cedure. We argue that non-consonant closed test- 
ing procedures have been underrated, and illustrate 
the type of additional inference that is possible from 
a non-consonant closed testing procedure, but typ- 
ically neglected, before we argue how these addi- 
tional inferences can be used to construct a con- 
fidence set. Section 3 applies the approach to se- 
lection of variables in a multiple regression model. 
Section 4 explores computational issues related to 
closed testing procedures and proposes situations 
in which shortcuts can be found. Finally, Section 5 
looks at estimation of the number of correctly re- 
jected hypotheses. Software to perform the meth- 
ods described in this paper is available in the cherry 
package, downloadable from CRAN. 

2. NON-CONSONANT CLOSED TESTING 

The closed testing procedure (Marcus, Peritz and 
Gabriel, 1976) is well known as a cornerstone of fam- 
ilywise error control. In this section we show how 
closed testing may also be used to construct con- 
fidence sets for the number of falsely rejected hy- 
potheses. 
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Fig. 1. Intersection hypotheses formed by elementary hy- 
potheses Hi, H2 and H3. Rejected hypotheses have been 
marked with a cross. The rejection of H2 n H3 is a non-con- 
sonant rejection. 

First we introduce some notation. Let Hi, . . . ,H n 
be the collection of hypotheses of interest, the ele- 
mentary hypotheses, out of which we want to select 
hypotheses to follow up. Some of these hypotheses 
are true; let T C {1, . . . , n} denote the unknown in- 
dices of true hypotheses. To use a closed testing 
procedure, we must consider not only the elemen- 
tary hypotheses, but also all intersection hypothe- 
ses of the form Hj = f] i£l Hi, where I C {1, . . . , n}, 
1^0. Figure 1 illustrates the intersection hypothe- 
ses formed by three hypotheses Hi, H2 and #3 in 
the form of a graph, with arrows denoting subset 
relationships (ignore the crosses for now). 

An intersection hypothesis Hj is true whenever all 
Hi, i£ I, are true, that is, whenever I C T. Let the 
closure C be the collection of all nonempty subsets 
of the index set {1, . . . , n}. Each element of C corre- 
sponds to an intersection hypothesis, some of which 
are true. Let T = {I G C : I C T} be the subsets cor- 
responding to true intersection hypotheses. The col- 
lection C also contains singleton sets. Noting that 
we can equate Hi = Hu\, let H = {I £ C : #1 = 1} 
be the subsets corresponding to the elementary hy- 
potheses. 

The closed testing procedure works as follows. It 
requires a-level tests for every intersection hypoth- 
esis Hi, I G C, which are called the local tests. Ap- 
plying these local tests, let U C C be the collection 
of subsets U G C for which the test rejects the hy- 
potheses Htj. The collection IA represents the raw 
rejections uncorrected for multiple testing. Based on 
these raw rejections, the closed testing procedure re- 
jects every I £ C for which J G U for every J D I. 
Denote the collection of all such / by X. It was 
shown very elegantly by Marcus, Peritz and Gabriel 
(1976) that with this rejected set the closed testing 
procedure strongly controls the familywise error for 



all hypotheses Hi, I £ C, at level a. They showed 
that the event E = {Ht ^ U}, which happens with 
probability at least 1 — a, implies that X n T = 0- 

In the example of Figure 1, suppose that the hy- 
potheses rejected by the local tests are the ones 
marked with a cross. In this example Hi is rejected 
by the closed testing procedure because the four hy- 
potheses Hi, H 1 DH 2 , HiD H 3 and HiPiH 2 n H 3 
are all rejected by their local test. In fact, in the 
example of Figure 1 we have X = U, because each 
hypothesis rejected by the local test has all its an- 
cestors in the graph of Figure 1 rejected. 

When using the closed testing procedure for fami- 
lywise error control, the intersection hypotheses are 
generally constructed for the benefit of the proce- 
dure, but are not of genuine interest by themselves. 
The reported result of the procedure is therefore 
usually not the collection X , but only X D %■ From 
the perspective of familywise error control, a rejec- 
tion I E X for which there is no J G X D T~L with 
J C / is a wasted rejection. Such a rejection was not 
instrumental in facilitating a rejection of interest; if 
that rejection had not occurred, the same rejected 
set X n % of elementary hypotheses would have re- 
sulted from the procedure. This consideration has 
led to a quest for consonant closed testing proce- 
dures. A closed testing procedure is consonant if the 
local tests for every I EC are chosen in such a way 
that rejection of I implies rejection of at least one 
J G H. It is easily shown that for every closed testing 
procedure there is a consonant procedure that re- 
jects at least as much in X fl H. Moving from a non- 
consonant to a consonant procedure may often lead 
to a gain in power on the elementary hypotheses. 
From a familywise error perspective, consonance is, 
therefore, a desirable property, and non-consonant 
procedures are best avoided (Bittman et al., 2009). 

However, once we are interested in milder infer- 
ence than a familywise error-based one, the premise 
that only rejection of the elementary hypotheses Hi, 
. . . , H n is of interest should be dropped, and non- 
consonant closed testing procedures need not be 
avoided. We illustrate this with the simple example 
of Figure 1 , which will immediately serve as a small 
showcase of the point of view on multiple testing we 
propose in this paper. Here, the only one of the el- 
ementary hypotheses that has been rejected is Hi. 
Of the intersection hypotheses we see three "con- 
sonant" rejections, namely Hi n Hi n H%, Hi n Hi 
and Hi n H3, which have all facilitated rejection of 
the elemental hypothesis Hi. We also see one "non- 
consonant" rejection, H 2 n H3. A familywise error 
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perspective would dictate rejection of Hi and noth- 
ing else. An exploratory perspective, however, on the 
same data would lay the choice what and how many 
hypotheses to reject with the user. An obstinate user 
could, for example, choose not to reject Hi, but to 
reject H2 and H3. What can we say about the risk 
incurred by such a user in terms of the number of 
false rejections? 

In general, the number of false rejections made 
when rejecting the hypotheses Hi, i £ R, is equal to 
t(R) = #(Tn R), the number of true null hypothe- 
ses among R. For a given set R, this quantity is 
just a function of the model parameters, for which 
we can find estimates and confidence intervals just 
like for any other function of the model parameters. 
The confidence interval takes the form of a confi- 
dence set, because t(R) only takes discrete values. 
We come to the issue of estimation later, and first 
construct such a confidence set. 

To construct a confidence set, define 

C R = {IeC:IQR}, 

the collection of all intersection hypotheses involving 
only rejected hypotheses, and let 

t a (R) = max{#/ :l€C R ,I$ X}, 

taking t a {R) =0 if Cr C X . The quantity t a (R) is 
the size of the largest subset of R for which the 
corresponding intersection hypothesis is not rejected 
by the closed testing procedure. We claim that the 
set 

(1) {0,...,t a (R)} 

is a (1 — a)-confidence set of the parameter t(R). 

To prove the coverage of this set, remember that 
if the event E has happened, then all rejections 
that the closed testing procedure has made are cor- 
rect. Given that E has happened, the value of t(R) 
cannot be greater than the value of t a (R), because 
otherwise a true intersection hypothesis would have 
been rejected, which is inconsistent with the defini- 
tion of E. Consequently, t(R) € {0, . . . ,t a (R)} with 
probability at least P(E) = l—a, which makes {0, . . . , 
t a (R)} a (1 — a)-confidence set for t(R). 

The confidence set (1) is always one-sided, never 
providing a nontrivial lower bound for t(R). The 
reason for this is that the confidence set originates 
from a procedure that is focused on rejecting, not on 
accepting null hypotheses. Furthermore, for many 
applications the null hypotheses are point hypothe- 
ses, of which it can never be proved that they are 



Table 1 

Confidence sets for the numbers of incorrect rejections t(R) 
and correct rejections <j>(R) incurred with various choices of 
the rejected set, based on the closed testing result of Figure 1 



R Confidence set for t(R) Confidence set for </>(ii) 



{1} 


{0} 


{1} 


{2} 


{0, 1} 


{0, 1} 


{3} 


{0, 1} 


{0, 1} 


{1,2} 


{0, 1} 


{1,2} 


{1,3} 


{0, 1} 


{1,2} 


{2,3} 


{0, 1} 


{1,2} 


{1,2,3} 


{0, 1} 


{2, 3} 



true. In these cases, no procedure can produce a con- 
fidence interval with a nontrivial lower bound, and 
the upper bound is the only bound of real interest. 

Often interest is in quantifying not the number 
of true hypotheses in R, but the number of false 
hypotheses 4>(R) = #R — t(R). A confidence set for 
(j)(R) follows from (1) immediately as 

{f a (R),...,#R}, 

where f a (R) = #R — t a (R)- Confidence sets for other 
quantities that depend only on t(R) and jf=R, such 
as the false discovery proportion r(R)/#R, may be 
derived in a similar way. 

Returning to the example of Figure 1 with choice 
of a rejected set, R = {2, 3}, we have a realized value 
of t a (R) = 1. We conclude that {0, 1} is a (1 — a)- 
confidence set for the number of false rejections in- 
curred when rejecting H2 and H%. Even though nei- 
ther H2 or H3 was rejected by the closed testing 
procedure, when rejecting both H2 and H3 the user 
can be confident of making at least one correct rejec- 
tion. The choice of R = {2,3} is only one of many 
possible rejection choices that the user can make. 
For each alternative choice, a confidence set can be 
made in the same way as for R = {2, 3}. These con- 
fidence sets, and the corresponding confidence sets 
for 4>{R), are given in Table 1. 

The important thing to note about confidence sets 
of the form (1) is that they are simultaneous con- 
fidence sets, which all depend on exactly the same 
event E for their coverage. Because these confidence 
sets are simultaneous, the user can review all these 
confidence sets, and select the rejected set it! that 
he or she likes best, while still keeping correct 1 — a 
coverage of the selected confidence set: under the 
event E, all confidence sets cover the true parame- 
ter simultaneously, and therefore, under the same 
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event E, the selected confidence set covers the true 
parameter. Consequently, the selected confidence set 
has coverage of at least P(E) = 1 — a. The simulta- 
neity of the sets makes their coverage robust against 
post hoc selection. 

In the specific case of Table 1, the user might 
choose to follow up on all three hypotheses, which 
would give him or her confidence in at least two 
discoveries of a false null hypothesis. On the other 
hand, if sufficient funds are available for only two 
validation experiments, the user may choose to fol- 
low up on any two hypotheses, any pair giving con- 
fidence of obtaining at most one false positive. 

Contrary to the application of closed testing for 
familywise error control, in terms of confidence sets 
non-consonant rejections do improve the results ob- 
tained from the procedure. Without the rejection of 
-f^nf^ in Figure 1 the confidence sets for R = {2, 3} 
and for R = {1,2,3} would have been larger than 
the ones given in Table 1. From the definition of 
consonance it follows immediately that the value 
of t a (R) in a consonant closed testing procedure is 
equal to the number of hypotheses in R that are 
not rejected by the closed testing procedure under 
a familywise error regime. In non-consonant closed 
testing procedures, the value of t a (R) can be sub- 
stantially smaller, as we shall see in examples be- 
low. 

Essentially, the example of Table 1 summarizes 
the confidence set approach to multiple testing. The 
user has unlimited options in selecting what to re- 
ject, and may review all options and their conse- 
quences in order to make his or her choice. This ap- 
proach fulfills all three criteria set for multiple test- 
ing in exploratory research formulated in the intro- 
duction. The procedure is flexible, because it does 
not prescribe any rejections but leaves the choice 
which hypotheses to follow up completely in the 
hands of the user. The procedure is mild, because it 
allows any number or proportion of false rejections 
that the user desires. Furthermore, the procedure is 
post hoc, because it allows the user to review the 
consequences, in terms of the potential number of 
false rejections, of any choice of rejected hypothe- 
ses before making a final choice, without compro- 
mising the quality of the inferences obtained. Still, 
even with the lenience of all these properties, the 
inferential statements resulting from the procedure 
are absolutely classical and rigorous, requiring no 
new definitions of error rates but only the classical 
concept of simultaneous confidence sets. 



Table 2 

Uncorrected p-values (t-test) for relevance of variables in the 
full model and selected model 



Covariate 


Full model 


Selected model 


(Intercept) 


0.036 


0.000 


Forearm 


0.061 


0.000 


Biceps 


0.755 




Chest 


0.420 




Neck 


0.518 




Shoulder 


0.905 




Waist 


0.000 


0.000 


Height 


0.033 


0.005 


Calf 


0.303 




Thigh 


0.351 


0.036 


Head 


0.105 





3. EXAMPLE: SELECTING COVARIATES 
IN REGRESSION 

One area of statistics in which common practice 
is highly exploratory and post hoc is the selection 
of covariates in a multiple regression. Methods such 
as forward or backward selection, or their combina- 
tion, are typically used to select a model containing 
a subset of a set of candidate covariates. Often, p- 
values that are reported for the selected covariates 
completely ignore the selection process. The confi- 
dence set method outlined in the previous section 
can be used in this situation to set confidence lim- 
its to the number of selected variables that is truly 
associated with the response variable. 

As an example, consider the physical dataset (Lar- 
ner, 1996), in which 10 physical measurements on 22 
male subjects (length, and circumference of various 
parts of the body) are used as covariates for mod- 
eling body mass. An analysis based on a linear re- 
gression model with a forward-backward algorithm 
selects the four covariates forearm, waist, height and 
thigh as the relevant variables. Table 2 gives the p- 
values of the covariates in both the full and the se- 
lected model. The reported p-values of the selected 
model are known to be anti-conservative as they do 
not take the selection into account. An important 
question to ask, therefore, is how many truly rele- 
vant variables are, in fact, included in this selection. 
This would give a measure of confidence for the se- 
lected set. 

Following the strategy outlined in the previous 
section, we construct a linear regression model with 
an intercept and 10 regression coefficients j3\, . . . , /3io, 
and define the elementary hypotheses Hi, i = l, . . . , 
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10, to be the hypotheses that the corresponding 
regression coefficient = 0. Next we construct all 
1,023 intersection hypotheses Hj, I S C, each of which 
corresponds to the hypothesis that (3j = for all 
j £ I. As the local tests we choose the -F-test of 
the corresponding null model against the saturated 
model, tested at level a = 0.05. 

The closed testing procedure rejects 626 out of 
the 1023 hypotheses, among which there is one el- 
ementary hypothesis: waist. Several non-consonant 
rejections have occurred. We can summarize these 
by finding the defining rejections, that is, the re- 
jections I £ X which have no rejected subset J C I, 
J £ X . For this dataset, these defining rejections are 
the following seven sets: 

{waist} 

{forearm, neck, shoulder, height} 

{forearm, biceps, shoulder, calf} 

{forearm, shoulder, height, calf} 

{forearm, biceps, chest, neck, shoulder, thigh} 

{forearm, shoulder, height, thigh} 

{forearm, calf, thigh} 

As each of these sets corresponds to a rejected 
intersection hypothesis, we can conclude with 95% 
confidence that each of the seven sets contains at 
least one truly relevant covariate. It is tempting to 
say that, beside waist, forearm must be relevant, 
since it is included in all defining sets except the 
first. This is not warranted, however, as the sets are 
also consistent with alternative truths, such as that 
both shoulder and thigh are relevant variables. What 
we can conclude, is that if we select, for example, the 
set 

(2) R = {waist, forearm, calf, thigh} 

we have selected at least two relevant variables. Fur- 
thermore, we can also directly conclude that waist 
is a relevant variable. 

Coming back to the set R = {waist, forearm, 
height, thigh} selected by the forward-backward pro- 
cedure, we can find all 15 intersection hypotheses of 
the four hypotheses in R and check whether they 
were rejected by the closed testing procedure. We 
find that R £ X, but {forearm, height, thigh} ^ X, 
so that t a (R) = 3. Therefore, we can say with con- 
fidence that the selected set R contains one truly 
relevant hypothesis, but not that it contains more 
than one. From this result, it is clear that the p- 
values given for the selected model in Table 2 are 
highly untrustworthy. 



To find out how many of the original 10 hypothe- 
ses are relevant, we take R to be the full set of 10 
hypotheses, and we calculate t a (R) = 8 for this set. 
Apparently, we can conclude that there are at least 
two covariates among these 10 that are determinants 
of mass. The smallest set that contains at least two 
relevant covariates is the set (2). 

It should be noted that the set selected by vari- 
able selection procedure should not generally be ex- 
pected to be optimal from a confidence set perspec- 
tive, because the perspectives of the two procedures 
are quite different. This is best illustrated by think- 
ing of a dataset in which there are two covariates 
which are both highly correlated with each other, 
and with the response. Variable selection algorithms 
will always choose one of the two variables, disre- 
garding the second one as superfluous given the first. 
The confidence set approach, however, will empha- 
size the uncertainty of the choice between the two 
variables, and will not reject any intersection hy- 
pothesis that involves only one of the two covariates. 
To have confidence that at least one truly relevant 
covariate is included, both of the highly correlated 
variables must be selected. This reflects a difference 
in emphasis between the two approaches: variable 
selection selects optimal sets, whereas the confidence 
set approach quantifies the uncertainty inherent in 
the selection process. 

It is interesting to investigate the price of post 
hoc selection relative to a priori selection. It is im- 
mediate from the procedure that reducing the tested 
set of hypotheses to a set R a priori is at least as 
powerful as, and likely more powerful than, testing 
a larger set and selecting the same set R post hoc. 
Post hoc selection will generally result in wider con- 
fidence sets than a priori selection: this is the price 
to be paid for the risk of overfit caused by post hoc 
selection. In the example this price is surprisingly 
small. If the set R = {waist, forearm, height, thigh} 
would have been defined a priori as the set of hy- 
potheses of interest, treating the remaining covari- 
ates' regression coefficients as nuisance parameters, 
the confidence set of 4>{R) improves from {1,2,3,4} 
to {2, 3,4}. The confidence set for 4>{R) for the set R 
defined in (2) does not change. 

4. SHORTCUTS 

In its standard form, application of a closed test- 
ing procedure requires 2 n — 1 tests to be performed. 
Smart bookkeeping can reduce this number some- 
what, especially if some intersection hypotheses high 
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up in the hierarchy turn out non-significant, because 
it can be used that if / ^ X, then immediately J ^ X 
for every J C I, which saves calculation of some of 
the tests. Still, even with such tricks and with high 
computational power, the closed testing procedure 
becomes computationally intractable in its general 
form for a number of hypotheses around 20-30, de- 
pending on the computational effort needed for each 
single test. 

If a large number of hypotheses is to be investi- 
gated, it is, therefore, convenient if the local tests 
can be chosen in such a way that not all these tests 
need to be calculated. Methods for avoiding calcu- 
lation of some of the hypothesis tests in the closed 
testing procedure are known as shortcuts. The lit- 
erature on shortcuts in the closed testing procedure 
has been focused mainly on consonant procedures, 
and on finding the rejected individual hypotheses 
(Grechanovsky and Hochberg, 1999; Zaykin et al., 
2002; Hommel, Bretz and Maurer, 2007; Bittman 
et al., 2009; Brannath and Bretz, 2010). In this sec- 
tion, we loosely extend the concept of shortcuts to 
non-consonant procedures, and discuss ways of find- 
ing t a (R) in a computationally easy way for spe- 
cific choices of the local test, namely those based 
on Fisher combinations, on Simes' inequality, and 
on sums of normally distributed test statistics. We 
also demonstrate how the permutation-based pro- 
cedure of Meinshausen (2006) fits into the closed 
testing framework. Finally, we touch upon the pos- 
sible use of other procedures than closed testing for 
constructing confidence sets. 

4.1 Fisher Combinations 

The case of independent null hypotheses deserves 
special attention benchmark, because several 
important multiple testing methods (Benjamini and 
Hochberg, 1995; Efron et al, 2001; Storey, 2002) 
have been initially formulated for independent hy- 
potheses only. Independent tests are relatively rare 
in practical applications. 

One highly suitable choice for the local tests in the 
independent case is Fisher's combination method. 
It requires only the p- values p\ , . . . , p n of the tests 
of the elemental hypotheses Hi, . . . , H n , and rejects 
an intersection hypothesis corresponding to I G C 
whenever 

-2^1og(pi) >£#/, 
iei 

where g r is the (1 — a)-quantile of a x 2 -distribution 
with 2r degrees of freedom. This test is a valid a- 



Table 3 

Adverse events data, taken from Herson (2009), sorted to 
increasing p-values and with a typo corrected 



Adverse event 


p- value 


Anemia 


0.02 


Myocardial infarct 


0.03 


Diarrhea 


0.04 


Nausea and vomiting 


0.04 


Stomatitis 


0.08 


Skin rash 


0.10 


Dehydration 


0.12 


Shortness of breath 


0.18 


Renal failure 


0.20 


Fever 


0.23 


Blurred vision 


0.26 


Nose bleed 


0.28 


Anorexia 


0.30 


Bronchitis 


0.31 


Wheezing 


0.40 


Headache 


0.50 



level test of the hypothesis Hj if the p-values pi, 
i £ I, are independent. Note that the requirement 
of being a valid local test only refers to intersection 
null hypotheses that are true, so that there is no 
requirement of independence among p-values of false 
null hypotheses, nor even between p-values of true 
and false null hypotheses. 

Fisher's method is highly non-consonant, as sum 
test often are. Moreover, the simple structure of the 
local tests allows easy shortcuts to be formulated. 
For any s < i^R, we have that t a (R) < s if 

u(R,s + l) > max {g s +j+i - u(R,j)}, 

0<j<M 

where u(I, k) is the sum of the k smallest values 
of — 21og(pj) with i € /, R is the complement of R, 
and M is the number of values of — 21og(pfc) in R 
smaller than the (s + l)th largest value of —2 log(pjt) 
in R. This shortcut, related to the shortcut of Za- 
ykin et al. (2002), allows calculation of t a (R) for 
any R without exponentially many tests having to 
be calculated. It is an example of a general method 
for finding shortcuts for exchangeable tests, which 
we explain in Appendix A. 

As an example, consider the following application 
in the realm of adverse drug reactions. Consider the 
data in Table 3, which give raw p-values for null 
hypotheses concerning the presence of adverse drug 
reactions reported for a certain drug. We assume the 
hypotheses to be independent, although the validity 
of this assumption can be disputed. 
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■ correct rejections (95% conf.) 
□ others 




1 2 3 4 5 6 7 



9 10 11 12 13 14 15 16 



number of rejections 

Fig. 2. Number of correct rejections versus number of re- 
jections for the data of Table 3. The bars only give the lower 
bound of the 95% confidence interval; the number of false null 
hypotheses (correct rejections) is likely to be larger than indi- 
cated. 

An analysis based on familywise error rate (Sidak, 
1967) or false discovery rate (procedure of Benjamini 
and Hochberg, 1995) results in no rejections for these 
data. However, among the hypotheses with small p- 
values, the researcher notices three hypotheses con- 
cerned with problems of the gastrointestinal tract: 
diarrhea, nausea and vomiting, and stomatitis. The 
researcher may hypothesize that the drug causes 
problems in this area, and may consider following up 
on these three hypotheses. For this choice of R we 
can calculate f a (R) = 1 at a = 0.05, and we can con- 
clude that the drug in question has at least some ad- 
verse effect somewhere in the gastrointestinal tract. 
The researcher can be confident that following up 
on these three hypotheses will lead to at least one 
potentially successful validation experiment. 

Alternatively, if the researcher wants to optimize 
the number of correct rejections, he or she may sim- 
ply wish to reject those hypotheses that have the 
smallest p- values. In that case the only choice the 
researcher has to make is the number of rejections, 
and a plot such as Figure 2 may be made, which 
plots the lower bound of the number of correct re- 
jections f a (R) against the number of rejections jf~R- 
Based on this plot, the researcher can claim with 
95% confidence that at least five adverse drug reac- 



tions occur for this drug, and that these are found 
among the hypotheses with the 10 smallest p- values. 
If the researcher does not have funds available for 10 
follow-up experiments, the researcher may want to 
validate the top six, which gives confidence of find- 
ing at least four false null hypotheses, or perhaps 
the top three, for confidence of finding at least two 
false null hypotheses. 

Figure 2 also illustrates the link between our pro- 
posed approach and the fc-FWER criterion. A user 
wishing to control /c-FWER can reject any set R 
that has t a (R) < k, for example taking R as the set 
corresponding to the i smallest p-values, choosing i 
as the largest value such that t a {R) < k still holds. 
The graph of Figure 2 simultaneously shows the 
numbers of rejections allowed with k = 1, 2, 3,4, . . . , 
which are given by i = 0, 3, 6, 7, A major advan- 
tage of our approach over traditional fc-FWER con- 
trol approaches is that control is simultaneous over 
all rejected sets, and therefore over all choices of k. 
The procedure thus bridges the gap between weak 
FWER control, related to ra-FWER, and strong 1- 
FWER control. Furthermore, rather than choosing k 
in advance, its value may be picked after seeing the 
data without destroying the associated control prop- 
erty. The link between fc-FWER and our approach 
is not limited to local tests based on Fisher combi- 
nations. 

An interesting feature of using Fisher's method 
in combination with the confidence set approach is 
that the method may prove the presence of false 
null hypotheses even in the situation that no indi- 
vidual p- value is smaller than a. Consider the fol- 
lowing p- values, taken from Huang and Hsu (2007): 

pi = 0.051; p 2 = 0.064; 

p 3 = 0.097; p 4 = 0.108. 

Even though all p- values are non-significant individ- 
ually, the confidence set for 4>(R) when rejecting the 
top two hypotheses is {1,2}, when rejecting the top 
three hypotheses {2,3}, and when rejecting all four 
hypotheses {2,3,4}. This indicates that even in ab- 
sence of any individually significant hypotheses we 
can make a rigorous confidence statement that at 
least two out of the first three hypotheses are false. 

Fisher's method is highly non-consonant, and can 
be very powerful, especially if there are many mod- 
erately small p- values. It is not uniformly more pow- 
erful than other tests, however. Compared to con- 
sonant local tests, such as Sidak's, Fisher's method 
tends to have smaller values of t a {R) for large re- 
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jected sets R due to its large number of non-conso- 
nant rejections, but Sidak's method often has more 
sets R which have t a (R) = 0, due to a higher number 
of consonant rejections. 

4.2 Simes Type Local Tests and Permutations 

A different type of local test with potentially non- 
consonant rejections is a type of test that rejects 
a hypothesis Hi , I EC, whenever 

(3) PU < cf 

for at least one 1 < % < where p,^ is the ith 

smallest among the p- values {pj}j<=i of the elemen- 
tary hypotheses with indices in /, and cf 1 , 1 < m < n, 
1 < i < m, are appropriately chosen critical values. 
Without loss of generality we can take c™ < cj 1 if 
i < 3- 

We call local tests of the form (3) Simes type local 
tests because if we choose 



(4) 



la 
m 



the test based on (3) is a valid a-level test of Hi 
by Simes' (1986) inequality. Simes' inequality holds 
whenever p-values of true null hypotheses are inde- 
pendent, but also under more general conditions, as 
investigated by Sarkar (1998). In particular it holds 
for p-values from identically distributed, nonnega- 
tively correlated test statistics. 

A variant of Simes' inequality has been proposed 
by Hommel (1983). This variant uses critical values 



(5) 



■ m 



where K m = ^2™ = iV~ 1 . Unlike the one based on 
Simes' inequality, the local test defined by these crit- 
ical values is of the correct level a whatever the de- 
pendence structure of the original p- values. 

Local tests of the form (3) do not generally allow 
shortcuts for the calculation of t a (R), but two use- 
ful shortcuts are available if the critical values are 
chosen in such a way that 



(6) 



whenever l> m. 



The first shortcut this condition allows is the gen- 
eral shortcut described in Appendix A, the condi- 
tions of which are fulfilled whenever (6) holds. The 
second shortcut is even faster to calculate, but is 
less general: it holds for rejected sets of the form 
R = {i : pi < q} only. Let p^ , i = 1, . . . , n, be short 



for p 1 ^ with I = {1, . . . ,n}. For R of the form men- 
tioned, we have the shortcut 



(7) 



f a (R) > max{S r : 1 < r < #R}, 



where S r = max{s > :pi r ) < c^_ s }. The value of S r 
can be interpreted as the number of more stringent 
critical values c™_ 1( . . . , c\ by which the p- value p( r ) 
overshoots its mark c™. The number of false hy- 
potheses is larger than the greatest such overshoot of 
the ordered p- values in the set R. The shortcut (7) is 
useful for making plots such as the one in Figure 2. 
We prove this shortcut in Appendix B. A slightly 
more powerful variant of the shortcut (7) is avail- 
able if we have 



(8) 



<C r ! 



for every 1 < w < i. 



In this case, we have the same shortcut as (7), but 
with S r = maxjs > :p( r ) < cpZf } • The proof of this 
statement is analogous to the proof for (7) and is 
also given in Appendix B. It is easy but tedious to 
show that the Simes critical values (4) and (5) con- 
form to (6) and that the critical values (4) also con- 
form to the stronger (8), so that the shortcuts may 
be used for these choices of the local test (see also 
Benjamini and Heller, 2008). 

As a side note, we remark that the critical val- 
ues (4) and (5) are the same as the critical val- 
ues used in the false discovery rate controlling algo- 
rithms of Benjamini and Hochberg (1995) and Ben- 
jamini and Yekutieli (2001), respectively. The corre- 
spondence between the critical values creates a con- 
nection between the corresponding methods. The set 
that has been rejected by the false discovery rate al- 
gorithm always has f a (R) > based on the closed 
testing procedure with the corresponding local test. 
Note that the assumptions underlying each local test 
and its corresponding false discovery rate algorithm 
are very similar. For the example data of Section 4.1, 
the Simes local test leads to no rejections, which is 
consistent with finding no rejections with the proce- 
dure of Benjamini and Hochberg (1995). 

Permutation testing can be a powerful tool to take 
into account the joint distribution of the p- values. 
Useful shortcuts in a closed testing procedure with 
permutation-based local tests can be constructed 
from the work of Meinshausen and Biihlmann (2005) 
and Meinshausen (2006). These authors describe 
a permutation-based way to find critical values ki, 
i = 1, . . . , n, such that the probability under the com- 
plete null hypothesis that p^ < ki for at least one i 
is bounded by a. The same method may in principle 
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also be used to find corresponding permutation crit- 
ical values k\ for every intersection hypothesis Hj, 
I G C, and therefore a local test for every intersec- 
tion hypothesis Hj; a closed testing procedure can 
be made on the basis of these tests. However, un- 
less the number of hypotheses is limited, this will 
be extremely time-consuming, and it would lead to 
a closed testing procedure for which shortcuts are 
not available. A way out of this dilemma can be 
found by remarking that, by construction of the per- 
mutation critical values, we have k\ > ki for every i 
and /. Therefore, a valid, though conservative, local 
test may be constructed by simply using a procedure 
of the form (3) with = ki for every 1 < m < n. 
This local test fulfils the condition (6) and therefore 
admits shortcuts. With this choice of a local test, the 
confidence set approach to multiple testing can be 
used for every collection of test statistics for which 
permutation is possible, opening up the possibility 
to use permutation-based closed testing in genomics 
research. 

Meinshausen (2006) constructed simultaneous con- 
fidence bands for the number of falsely rejected hy- 
potheses for rejected sets of the form R = {i :pi < q}, 
similar to Figure 2, based on the permutation crit- 
ical values k\, . . . , k n he found. Even though Mein- 
shausen did not use closed testing, these confidence 
bands are identical to the confidence bounds that 
would be obtained when using the local tests (3) 
with c™ = ki in combination with the shortcut (7). 
By exploiting the shortcut of Appendix A rather 
than this simpler shortcut, it becomes possible to 
extend Meinshausen's method to be able to find con- 
fidence bounds for r(R) for sets R not of the form 
R = {i :pi < q}. Alternatively, for a very small num- 
ber of tests, the full permutation-based closed test- 
ing procedure may be used, which could be more 
powerful. 

4.3 Normally Distributed Test Statistics 

Workable local tests may also be constructed on 
the basis of normally distributed scores. Consider 
the situation that we have scores z\,...,z n for each 
hypothesis Hi, ... , H n , respectively, which are stan- 
dard normally distributed if their respective null hy- 
pothesis is true, and we would reject Hi one-sidedly 
when Zi is large. This situation occurs quite fre- 
quently in practice, at least asymptotically, for ex- 
ample if we do many one-sided binomial z-tests. 
A sensible choice for a test statistic for a local test 
is Zj = Y2iei z i- Consider first the case in which the 
scores of true null hypotheses are independent. In 



that case Zj is normally distributed with mean 
and variance j^l, and we may reject Hi whenever 
Zi > ■ $(1 — a), where $ is the standard nor- 
mal distribution function. If the scores are not in- 
dependent but only jointly normally distributed, we 
have the following, more conservative result. In that 
case Z\ is normally distributed with mean but un- 
known variance. Let £ be the correlation matrix of 
{ z i}i&ii then the variance of Zj is given by 1 T S1, 
where 1 is a vector of ones of length #1. This vari- 
ance is bounded by times the largest eigenvalue 
of S, and therefore by (#/) 2 . It follows that for a < 
1/2, we may reject Hj whenever Zj > #1 ■ $(1 — a). 
This type of test was used by Van De Wiel, Berkhof 
and Van Wieringen (2009). 

Both tests are exchangeable and lead to easy short- 
cuts in the sense of Appendix A. In practice, the test 
for the non-independent case can be highly conser- 
vative if used for small values of a, unless the scores 
are strongly positively correlated. One case to note, 
however, is the case that a = 1/2, when the critical 
value is for both the independent and the gen- 
eral situation, negating the conservativeness of the 
latter. This situation is relevant for the method of 
Section 5. 

4.4 Other Types of Shortcuts 

Shortcuts of the form described in the appendices 
can only be used within a restricted class of local 
tests that is calculated as an exchangeable function 
of per-hypothesis statistics. Other types of shortcuts 
may be devised for other classes of local tests in the 
future. 

A very different way to construct confidence inter- 
vals of t(R) while avoiding calculation of the com- 
plete closed testing procedure is to use a different 
multiple testing procedure that still allows non-con- 
sonant rejection of some intersection hypotheses. Ex- 
amples of such procedures are the tree-based testing 
procedure of Meinshausen (2008), recently improved 
by Goeman and Solari (2010), the focus level pro- 
cedure of Goeman and Mansmann (2008), and the 
gatekeeping method of Edwards and Madsen (2007). 
These procedures allow familywise error inference on 
a collection of hypotheses comprising the elementary 
hypotheses and a selection from the 2 n — 1 intersec- 
tion hypotheses, and may produce non-consonant 
rejections on these intersection hypotheses. The re- 
sults of these procedures may be used as a basis for 
constructing confidence intervals in the same way as 
the results of the closed testing procedure were used 
in Section 2. 
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5. ESTIMATION 

In addition to the confidence interval, it can some- 
times be informative to have a point estimate of the 
number of true null hypotheses among a set of in- 
terest. Estimation of the number of true null hy- 
potheses has been a subject of recent interest in 
the context of genomic data analysis, and several 
authors (Schweder and Spj0tvoll, 1982; Benjamini 
and Hochberg, 2000; Langaas, Lindqvist and Fer- 
kingstad, 2005; Meinshausen and Buhlmann, 2005; 
Jin and Cai, 2007) have proposed methods for esti- 
mating t{R), although for R = {1, . . . , n} only. The 
quantity t(R) for R = {l,...,n} is commonly re- 
ferred to as ttq. 

The confidence intervals of the previous sections 
are easily adapted to produce a point estimate of t(R) 
for any set R. We propose to use the value ti/ 2 (R) 
as an estimate, the upper bound of the confidence 
interval, calculated at the significance level a = 1/2. 
This estimate can be seen as a conservative median 
estimate of the true quantity t(R): by the proper- 
ties of t a (R) derived in the previous sections, £1/2 
exceeds the value of r(i?) with a probability that is 
bounded above by 1/2. Furthermore, this property 
holds simultaneously for all R by the simultaneity of 
the confidence interval on which it is based, which 
makes the defining property of the estimate robust 
against selection of R. 

The estimate can be used to get an impression 
where the "midpoint" of the confidence interval is. 
Applying the procedure to the physical dataset of 
Section 3 at a = 1/2, we find the following defining 
rejections: 

{waist} 

{forearm} 

{height} 

{chest, calf, thigh} 
{neck, calf, thigh} 
{thigh, head} 

The estimated number of true null hypotheses among 
all 10 hypotheses, for which the 95% confidence set 
was {0, . . . , 8}, is calculated as 6, which points to an 
estimated number of four relevant variables in the 
regression. For the set R = {waist, forearm, height, 
thigh} of four variables selected by the stepwise pro- 
cedure, the number of truly relevant variables is es- 
timated at 3 (the 95% confidence set for this quan- 
tity was {1,2,3,4}). In contrast, the set (2) which 
included with 95% confidence at least two relevant 
variables, also contains an estimated number of two 



relevant variables. The smallest rejected set that 
contains an estimated number of four relevant vari- 
ables is the set R = {waist, forearm, height, thigh, 
head} . We can report this optimized set without fear 
of overfit, because the property that the number of 
truly relevant covariates is overestimated with prob- 
ability at most 1/2 holds simultaneously for all re- 
jected sets. 

In the adverse event data of Section 4.1 the num- 
ber of true null hypotheses among the 16 hypothe- 
ses is estimated at two using a Fisher local test at 
a = 1/2. In this case, all rejections turn out to be 
consonant: rejecting the 14 hypotheses with small- 
est p- values leads to an estimated number of false 
discoveries. If we use Simes rather than Fisher for 
the local test, we even obtain an estimated number 
of true null hypotheses among all 16 hypotheses. 

We warn against using the estimate of the num- 
ber of falsely rejected hypotheses by itself, without 
the associated confidence interval. To see the danger 
of this, consider the simplest "multiple testing prob- 
lem" in which only a single null hypothesis is tested. 
The estimation procedure of this section would esti- 
mate this hypothesis as true whenever the p- value is 
greater than 1/2, and as false whenever it is smaller 
than or equal to 1/2. This seems generally too le- 
nient a conclusion to be a viable strategy, although 
it may be useful in some highly exploratory and risk- 
seeking settings. In these situations, the special sta- 
tus of q = 1/2 in the shortcut of Section 4.3 may be 
of interest. 

6. CONCLUSION 

All exploratory research is essentially picking and 
choosing. From a large number of potential hypothe- 
ses to follow up, the researcher selects for further 
investigation those hypotheses or sets of hypotheses 
that stand out in the researcher's eyes. This selec- 
tion is made in complete freedom. The notion that 
any statistical method would dictate what the re- 
searcher should find interesting is contrary to the 
spirit of exploratory research. 

However, a well-known risk of picking and choos- 
ing is overfit, "cherry-picking." Patterns that strike 
the researcher as relevant and interesting may have 
arisen due to chance, and turn out to be false posi- 
tives in follow-up experiments. To protect a research- 
er against too many disappointments of this type, it 
is important to make a realistic assessment of the 
risk taken when following up on a certain collection 
of hypotheses. 
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In this paper, we have presented an approach to 
multiple testing that is especially designed for the 
requirements of exploratory research, and which re- 
verses the way that multiple testing methods are 
typically used. Rather than letting the user decide 
on the error rate, and the procedure on the rejec- 
tions, we let the user decide on the rejections, and 
the procedure on the error rate. Our approach does 
not rely on the definition of any new error rates, 
and has not even required the design of a new al- 
gorithm. The approach uses the classical concept of 
the simultaneous confidence set, together with the 
equally classical closed testing procedure, although 
both in a novel way. 

The end result of the procedure is a collection of 
confidence sets for the number of falsely rejected hy- 
potheses for all possible choices of the rejected set. 
The most important property of these confidence 
sets is that they are simultaneous. This simultaneity 
protects the user of the procedure against overopti- 
mism resulting from post hoc selection of the re- 
jected set, and removes many of the problems tradi- 
tionally associated with cherry-picking from a large 
set. 

Finally, the approach is very general, and the lim- 
its on its useability are mostly computational. The 
most important assumption we make is that the 
number of hypotheses potentially to be followed up 
is finite, and that these hypotheses may be enu- 
merated before starting the experiment. Aside from 
that, the ability of the closed testing procedure to 
work with any choice of a local test makes that pro- 
cedure very flexible. Only if the number of hypothe- 
ses becomes large, computational issues limit the 
choice of local tests to those for which shortcuts are 
available. The shortcuts described in this paper al- 
ready cover a wide range of application areas. More 
and improved shortcuts are likely to be found in the 
future. 

APPENDIX A: SHORTCUTS FOR 
EXCHANGEABLE LOCAL TESTS 

We present a fairly general method for construct- 
ing shortcuts in the closed testing procedure which 
can be used for finding t a (R) and are appropriate 
for the methods in Section 4. This shortcut delin- 
eates a class of local tests for which t a (R) can be 
calculated for any R by calculating only n 2 , rather 
than 2 n tests. We give the shortcut for a p-value- 
based method. The shortcut for methods based on 
other scores (e.g., Section 4.3) is completely analo- 
gous. 



Assume that the local test is exchangeable, that is, 
rejection of Hi, I EC, only depends on the set Pi = 
{pi}i£i of raw p-values, and not on the collection / 
itself. Let 5 be the function that maps from a set 
of p-values P to rejection, 5(P) = 1 if the collection 
P = Pi would lead to rejection of Hi, and S(P) = 
otherwise. Further, suppose that 

(9) 5({ Pl ,...,p k })>5({ qi ,...,q k }) 
whenever p\ < q\ , . . . , p k < q k , and that 

(10) 5{qUP)>5(P) 

whenever q < min(p £ P). 

If these assumptions hold, it can be shown that 
for any s < #R, 

(11) *(Q? +1 ugf) = l for every je{0,...,m R } 

implies t a (R) < s. Here, Q^ + ± is the set of the s + 1 
largest p- values of hypotheses in R; is the set of 
the j largest p-values of hypotheses not in R, and 
mn is the number of p- values not in R that are larger 
than the smallest p- value in Qf^i- 

To show this, note that by assumptions (9) and (10), 
equation (11) implies that 

8(Q? +1 UP I ) = 1 

for every I £C, and that therefore, by assumption (9), 

5(PjUPi) = 1 

for every / £ C and for every JCfi for which # J = 
s + 1. Consequently, J E X for every J C R for which 
# J = s + 1, so that t a (R) < s by definition. 

APPENDIX B: SHORTCUTS FOR 
SIMES-TYPE LOCAL TESTS 

Next, we prove the shortcut (7) for Simes-type 
local tests. Let R = {i :pi < q} be a rejected set, and 
assume that condition (6) holds. 

First, let r = #R, and remark that p( r ) < c™_ s , for 
some s > 0, implies that 

(1) f a (R) > s. 

To see why this is true, choose any K C R with 
#K >r — s and any J ~D K. Remark that p( r ) < c"_ s 
implies that 

pf r _ s) < pf r _ s) < P(r) < Cr- S < C*^ s . 

Consequently, K € X for every K C R with j^K > 
r — s, so that t a (R) <r — s, and (1) follows. To ob- 
tain the final statement (7), remark that f a (R) > 
f a (S) for every R^> S, and apply the bound (1) on 
all S C R of the form specified. 
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Analogously, if (8) holds, choose K and J as above, 
and let s = #(R \ J) < s. Then < c"Z| implies 
that 

P(r~s) - P(r) - c r-s - c -r-s - C r-s> 

noting, in the last inequality, that # J < n — s and 
that (8) implies (6). From this result (1) and (7) 
follow as above. 
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